This notebook demonstrates how to use the PyDrill module to connect to Apache Drill and query data. The complete documentation for PyDrill can be found at http://pydrill.readthedocs.io
The essential steps are:
You will first need to install PyDrill. This can be done by opening a terminal and typing:
pip install pydrill
After you've done this, you will be able to import the PyDrill module.
In [1]:
from pydrill.client import PyDrill
In [2]:
#Open a connection to Drill
drill = PyDrill(host='localhost', port=8047)
#Verify the connection is active, throw an error if not.
if not drill.is_active():
raise ImproperlyConfigured('Please run Drill first')
In [3]:
#Execute query in Drill
query_result = drill.query('''
SELECT JobTitle,
AVG( CAST( LTRIM( AnnualSalary, '$' ) AS FLOAT) ) AS avg_salary,
COUNT( DISTINCT name ) AS number
FROM dfs.drillworkshop.`*.csvh`
GROUP BY JobTitle
Order By avg_salary DESC
LIMIT 10
''')
#Iterate through the rows.
for row in query_result:
print( row )
In [5]:
df = query_result.to_dataframe()
df.head()
Out[5]:
In [ ]: